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Rare Disease Physician Targeting: A Factor Graph 

Approach 


Abstract 

In rare disease physician targeting, a major challenge is how to identify physicians who 
are treating diagnosed or underdiagnosed rare diseases patients. Rare diseases have extremely 
low incidence rate. For a specified rare disease, only a small number of patients are affected 
and a fractional of physicians arc involved. The existing targeting methodologies, such as 
segmentation and profiling, arc developed under mass market assumption. They arc not 
suitable for rare disease market where the target classes are extremely imbalanced. The 
authors propose a graphical model approach to predict targets by jointly modeling physician 
and patient features from different data spaces and utilizing the extra relational information. 
Through an empirical example with medical claim and prescription data, the proposed 
approach demonstrates better accuracy in finding target physicians. The graph representation 
also provides visual interpretability of relationship among physicians and patients. The model 
can be extended to incorporate more complex dependency structures. This article contributes 
to the literature of exploring the benefit of utilizing relational dependencies among entities in 
healthcare industry. 

Keyword: : factor graph, probabilistic graphical model, pharmaceutical, rare disease, 
targeting 
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Introduction 

A rare disease, also known as orphan disease, has very low prevalence rate that affects only a small 
percentage of population. In some extreme cases, for an example, Hutchinson-Gilford progeria 
syndrome only affects a few dozen children (Field et al. 2011). Given the low patient number and 
high cost of bringing new product into the market, it is essential for the pharmaceutical companies 
to develop budget friendly and efficient marketing approaches. 

Pharmaceutical companies use communication channels such as in-personal detailing or non- 
personal digital channels to raise disease awareness and deliver promotional messages to 
physicians (Narayanan and Manchanda 2009; Dong et al. 2009; Manchanda et al. 2004). Detailing 
channel is also used for patient education and health promotion. There is literature to show 
that academic detailing can help increase disease detection and provide early disease intervention 
(Cameron et al. 2010; Fox et al. 2008). Delivering educational messages related to the rare disease 
diagnosis and treatment can raise disease awareness and help diagnosis. When to launch a new 
orphan drug into the market, one practical question arises: under limited budget and resource, how 
can managers target only those relevant physicians instead of reaching out to vast majority? Many 
physicians treating rare disease cannot be directly identifiable from database containing diagnosis 
or prescription treatment data. Marketers need to use some predictive analytics to identify those 
individuals. On the other hand, its such a small affected population compared to that in common 
condition diseases. Finding undiagnosed patients and potential physicians is like looking for a 
needle in a haystack. 

To identify physicians having rare disease patients is a challenging task from many perspectives. 
First of all, some of the rare diseases are very hard to diagnose. Patients affected by rare diseases 
may be free of symptoms for a long time or the symptoms can be hidden behind common 
conditions (de Vrueh et al. 2013). On top of that, physicians rarely encounter such patients in 
their daily practice and have limited experience or knowledge. They may be unaware that they 
have undiagnosed patients with such rare diseases. Even for the diagnosed patients, the medical 
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record database may not capture all information because of the coverage or missing diagnosis 
codes in the system. These existing targets cannot be directly identified. But every potential target 
is valuable because of the market itself is very small. 

The current approaches for identifying targets in database marketing or target marketing are 
developed under assumption for large markets (Hughes 2005; Van den Poel et al. 2003). They 
prioritize customers by defined value. For example, one method is to derive customer targets 
through segmentation or clustering framework (DeSarbo et al. 2008). These approaches dont 
perform well in rare disease market where the class of interest is small and extremely imbalanced 
(Akbani et al. 2004). The physicians, especially primary care physicians, who treat rare disease 
patients have similar characteristics or patient profiles as other physicians. Segmentation and 
profiling methods group all look-alike physicians together and do not differentiate well for the 
true rare disease physicians. The researchers can also use supervised classification approaches. 
But the traditional classification models have difficulties to predict smaller classes well (Chawla 
2009). 

The objective of this article is to explore new ways to improve the targeting accuracy in the unique 
rare disease markets. Our motivation comes from the desire to enhance unsatisfactory results in 
the rare disease targeting practice. The traditional statistical models yielded targeting list with high 
false positive rate. One needed to reach out to a larger number of non-targeted physicians in order 
to cover some true rare disease physicians. Those models cannot effectively utilize dependencies 
among entities. We hope to improve the targeting accuracy in a new model by explicitly using 
the extra physician and patient relationship. In our rare disease targeting demonstration, there 
are two distinct feature spaces: patient and physician. These two spaces can be bridged by 
physician-patient treating relationship. But features and data dimensionalities from these spaces 
are completely different. A physician typically treats multiple patients with various conditions. 
The traditional classification model requires aggregating patient data into physician level before 
building physician classification model. Similarly, one can also build patient level prediction 
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model and then link patient predicted flags to physician targets in a separate step. But either of 
the methods throw away relational information. Due to the extremely imbalanced classes, these 
models tend to generate high false positive predictions. 

We propose a graphical model method to structurally model physician patient features together and 
utilize the additional relational information to improve target identification accuracy. Our hope is 
that the information from dependencies among physicians, patients, and between physician-patient 
can contribute to the accuracy gain. We first formulate the physician classification problem in 
a probabilistic joint distribution. The proposed model depicts the dependence structure among 
physicians and patients and relaxes i.i.d. assumptions. Then we use factor graph message passing 
algorithm to predict physician and patient labels. 

We organized the remaining of the article in the following way: in the next section, we discuss 
background of rare disease and review some related work. Then we describe the data source used 
for model development. Next we formulate the problem using graph representation that designed 
for rare disease physician identification. With the proposed model, we present the factor graph 
algorithm for target label prediction, and parameter estimation. Finally, experiments with real data 
as well as concluding remarks will be given in the last two sections. 


Related Work 

The United States Rare Disease Act of 2002 defines it as a disorder affects fewer than 200,000 
people. Other countries use similar definitions for example, European Organisation for Rare 
Diseases (EURORDIS 2005) defines rare disease prevalence rate to be less than 1 in 2,000 
people. The National Organization for Rare Disorders (NORD) at the National Institutes of 
Health (NIH) identified about 7,000 rare diseases. Collectively, rare diseases can affect 25-30 
million Americans (NORD 2013). Developing orphan drugs for rare diseases represents a unique 
opportunity to pharmaceutical industry. In marketing, rare diseases bring different challenges than 
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mainstream products. The healthcare environment calls for innovative methodologies to address 
these challenges. 

Many rare disease patients are undiagnosed or misdiagnosed. To identify these patients and their 
treating physicians using predictive model is a challenging task. One major challenge comes 
from the imbalance of the classes in the dataset. Classic statistical models or standard machine 
learning algorithms are biased toward larger classes in prediction. If not treating and measuring 
properly, most of rare disease patients will be mis-classified as the other major classes albeit the 
overall accuracy rate may appear to be high. Oversampling and undersampling are commonly used 
techniques to overcome the imbalance problems (Chawla 2009; Rahman and Davis 2013). The 
focus of these algorithms is to boost signal and reduce prediction bias by reusing existing samples. 
Our proposed models address the problem from different angle. Instead of modeling separately 
in patient or physician space, we develop probabilistic graphical models to take advantage of the 
extra relational information among different entities. 

There are very few literatures addressing rare disease physician targeting challenge. Some use 
predictive classification model to identify targets. Chawla and Davis (2013) proposed using 
collaborative filtering to predict personalized disease based on patient history, phenotype and 
comorbidities. Collaborative filtering leverages the similarity among patients to profile disease risk 
for each individual patient. Santoro et al. (2015) used hierarchical clustering to characterize and 
identify rare disease topologies. They apply random forests to derive the most important variables 
for profiling. The existing rare disease classification method works in either patient or physician 
data space. We propose a graphical model method to explicitly link physician and patient features. 

Probabilistic graphical model (Bishop 2006; Roller and Friedman 2009) uses graphical diagram 
to visualize the statistical dependence among random variables. It encodes the problem into a 
joint probabilistic distribution over a high dimensional space. In a complex system, the statistical 
inference is computational demanding. Because it requires multidimensional integration for 
unknown variables. Factor graph (Kschischang et al. 2001), a major class in graphical models, 
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can explain the dependencies among interacting variables. By factoring the global multivariate 
functions into several local functions, the factor graph can efficiently perform statistical inference 
through messaging passing algorithm. It is widely applied in statistical learning, signal processing 
and artificial intelligence (Loeliger et al. 2007). In the next sections, we will describe the data 
assets, formulate the probabilistic model and develop factor graph for physician targets prediction. 


Data Description 

We extract data from from IMS Health longitudinal prescription (Rx) and medical claims (Dx) 
database. To limit the scope, we focus on only one particular rare disease market. The selected 
rare disease is an inherited blood disorder caused by genetic defect. It is estimated to affect about 
1 in 50,000 people according to Genetic Home Reference (GHR) from NIH. 

The Rx data is derived from electronic records collected from pharmacies, payers, software 
providers and transactional clearinghouses. This information represents activities that take place 
during the prescription transaction and contains information regarding the product, provider, payer 
and geography. The Rx data is longitudinally linked back to an anonymous patient token and can 
be linked to events within the data set itself and across other patient data assets. Common attributes 
and metrics within the Rx data include payer, payer types, product information, age, gender, 3-digit 
zip as well as the scripts relevant information including date of service, refill number, quantity 
dispensed and day supply. Additionally, prescription information can be linked to office based 
claims data to obtain patient diagnosis information. The Rx data covers up to 88% for the retail 
channel, 48% for traditional mail order, and 40% for specialty mail order. 

The Dx data is electronic medical claims from office-based individual professionals, ambulatory, 
and general health care sites per year including patient level diagnosis and procedure information. 
The information represents nearly 65% of all electronically filed medical claims in the US. All data 
is anonymous at the patient level and HIPAA compliant to protect patient privacy. 
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For model development, we pull the diagnoses, procedures and prescriptions at transaction level 
using study period from January 1, 2010 to July 31, 2015. In the rest of the article, we name a 
patient with the rare disease condition as “positive patient” and name the rest of them as “negative 
patient”. Similarly, let “positive physician” be a physician who treats at least one positive patient, 
and “negative physician” be a physician treating only negative patients. 

From the extracted data, we can positively identify 1,233 true rare disease patients with valid 
records such as gender, age and region. To boost the positive signal and model development, 
we construct a training and validation data by matching each positive patient with 200 randomly 
selected negative patients. The final patient data contains 1,233 positive patients and 246,600 
negative patients. The positive ratio in the training data is about 0.5%. 

Based on the linkable anonymous IDs in the patient data, we further pull data of physicians who 
have treated those patients in predefined selection period. Physicians and patients are linked if 
they have associated in at least one medical claim record in the selection period. We end up 
with 68,898 unique physicians in total and among those 8,346 positive physicians. There are 
1,463,030 physician-patient links stored in a separate database. On average each patient has visits 
5.9 physicians, and each physician has treated 21.23 patients. 

Up to this point the experiment data exhibits some challenges. First, the imbalance among positive 
and negative classes limits the performance of many common machine learning and statistical 
models like regression, support vector machine and decision trees. Second, the complicated 
relationships between patients and physicians make it difficult to directly generate meaningful 
features as model input from raw data. Third, the large amount of data calls for an efficient 
inference algorithm instead of naive marginalization. We’ll propose our model in the next sections. 
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Problem Formulation 

Given the data we just described, let us formally define the problem in mathematics. Consider 
a multi-agent system that consists of N physicians Ai, and M patients Bj, with i G Ma = 
{1,2, and j G Mr = {1,2, respectively. Each physician is associated with a 

feature vector z, : G M L representing physician Ai s features such as her specialty, age, gender, 
office location, etc. Similarly, each patient B, } is associated with a feature vector w j G M A 
denoting her features such as age, gender, diagnosis histories, etc. In a matrix form, let matrix 
Z G M 7VxL = [zi, z 2 , • • • , zn] t summarizes all the features of all the physicians, and let matrix 
W G M /Vx K = w . w 2 , • • • , w m ] t summarizes all the features of all the patients. 

Let x be patient label vector indicating if a patient has a specific rare disease. Specifically, 
x = [x\, x 2 , ■ ■ • , xm\ t G {0, 1} M is a binary vector with 0 or 1 entries. If Xj = 1 then patient Bj 
is positive, vice versa. Similarly, let us define the physician label vector as y = [yi, y 2 , • • • , y ; v] G 
{0, 1} A . Then by the definition of positive physician and negative physician, the studied physician- 
patient network can be illustrated as a graph G = ([ Ma,Mb]M ) i n Figure 1. In this graph, each 
square shape denotes a physician and each circle a patient; the red color denotes positive label and 
the blue negative label. Please note that other than the physician and patient’s labels and features, 
the patient-to-physician relationship is also known from the data. 

With these defined components, we can formulate the problem as follows. Given known patient 
features W , physician features Z and the physician-patient network G, for all i G Ma, find the 
estimate of y, that minimizes the mean square error, referred to as the Minimum mean square error 
(MMSE) estimate. It can be shown that this MMSE estimate has the form 

Vi = argminE(yj - y^ 2 

Vi 

= E {t/i|W, Z} 


where the first expectation is taken over both W, Z and , and the second expectation refers to the 
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Positive Patient 
Positive Physician 
Negative Patient 


Negative Physician 


Figure 1: This schematic depicts the physician-patient network 
expected value of y* with posterior distribution y(y,jW. Z). 

The goal is to find physician labels y, using patient and physician features such that the distance 
between y, and y, is minimized. By building probabilistic models, one can compute the posterior 
distribution p(y, |W, Z) and the prediction of physician label is the MMSE of y,. 


Proposed Model with Feature Engineering 

Next we show the derivation of the proposed model. To account for the dependencies among 
physicians and patients, we use Bayes network (Friedman et al. 1997) to build a model representing 
the joint distribution of all observed and latent variables 1 . Specifically, we show the explicit 
formulas of the joint distribution, which serves as the basis of the following section where we 
will convert the Bayes network to a factor graph. 

To depict the relationship between physicians and patients as well as the features and lables 

associated with them, we draw a four layer Bayes network in Figure 2. Using the definition from 

'Observed variables refer to the variables whose values are given, e.g., W and Z. Oppositely, latent variables refer 
to the variables with unknown value, like x and y in the model. 
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Figure 2: This graph depicts the relationship between random variables by employing Bayes 
network. Each node represents a random variable and each edges depicts a conditional probability 
distribution, y and Z represent physician labels and features; x and W represent patient labels 
and features, respectively. The whole graph yields a joint probability distribution of all the random 
variables. 


previous problem formulation, we can write the joint distribution of all the random variables in a 
factorized form as 


(1) p(W, x, y, Z) = p(Z|y)p(y|x)p(W|x)p(x), 

where physician features Z and patient features W are conditional independent given the 
corresponding physician or patient labels. Note that we remark the factors coming from either 
the patient side or the physician side. 

In equation 1, p{x) can be considered as a prior distribution of patient labels. Because the 
experiment data set is constructed by matching one positive patient with two hundred negative 
patients, we assign an a priori probability of one patient being positive to be 1/201. Equivalently, 
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we have 

M 

p( x ) = 

3 = 1 
M 

— J^ 7 y%r pyi )(l _ rjy( x i= 0 \ 

3 = 1 

with rj — 1/201 a priori. 

In the following subsections, we will develop the formulas for the rest of the components p( Z|y), 
p(W|x), andp(y|x). 

Patient Feature Distributions 

The patient feature distribution p(W|x) summarizes the patient information, which includes 
patient labels, demographics, diagnoses, procedures and prescription treatments. The condition 
distribution p(W|x) represents the patient feature distributions for both positive and negative 
classes. For clear demonstration, we divide the patient feature space into three categories: 

• w ; s : patient B / s self features including B / s gender, age decade and region. 

• w j e {0, l} 58 : a fifty eight by one row vector representing patient Bj ’s 58 clinical code 
indicators (Y/N). A clinical code can be either patient diagnosis, procedure or prescription 
filled.) 2 . 

• w f G Z 58 : a fifty eight by one row vector representing patient B / s clinical code frequency 

for the same 58 clinical codes. This feature represents how many times the event of 

prescription, procedure or diagnosis occurred during the study period. 

2 The size of the vector can be adjusted accordingly for other studies. 58 clinical codes here are specified for this 
experiment only. 
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By independence assumption between patients, we can get that 

M 

p(W|x) = JJp(wj|a;j) 

3 = 1 
M 

= Y[p(w?\Xi)p(wf\Xi)p(wf\wf, Xj) 

3 = 1 


In order to get this factor form, we make two assumptions here. First, we assume that the features 
as well as labels are mutually independent among patients. Second, given a patient label x 3 , this 
patient’s self features w ; s are independent with her clinical code features, i.e., and wj\ 

Specifically, we model 


( 2 ) p(wj\xj) = 

i=i 

where Vj G Mb, 

• w ' | G {0, 1}: patient gender 


I Ber(^) if x 3 = 1 , 
I Ber(r/“) if xj = 0. 


• w^ 2 G {0, 1, 2, • • • , 9}: patient age decade 


P( w j, i\ x j) 


I Cate ( 7 ^) if x 3 = 1, 
I Cate ( 7 °) if Xj = 0. 


where Cate( 7 ) means categorical distribution parametrized by vector 7 and the element-wise 
sums of both 7 ° G [0, l ] 10 and 7 ^ G [0, l ] 10 are one. 


G {1,2, 3, 4}: patient region whose value corresponds to ‘SOUTH’, ‘WEST’, 
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‘MIDWEST’, ‘NORTHEAST’ 


p( w | 3 kj) 


| Cate(7g) 
[ Cate (7°) 


if Xj = 1, 

if Xj = 0. 


where 7° e [0, l] 4 , and 7* e [0, l] 4 respectively. 

For patient clinical code indicator and frequency features, p(\Vj\xi) andp(w| |wj", a:*), we propose 
modeling by Bernoulli and Poisson distribution respectively. As the clinical codes have been 
grouped into 58 disjoint classes, here we further assume that 


58 


P( w f\ x i) = T[p( w j 

q=l 

58 

p(wf\wf,Xj) = J \p{w f j 


q\ X j)i 


^ \w^ ) 
9 1 w j,q» 


q = 1 


where Wj d and w? d denotes the c/th element of vectors w j and w|' respectively. 

Let 77° = ,<iv] T e [ 0 , l] 58 and rj\ = [77^,77^,. 

parameter vectors indicating the probability of getting positive clinical indicator for positive and 
negative patient classes respectively. Then we have the following distribution 


p d)N } T € [0, l] 58 be the 


j Ber (^, g ) if x i = \ 

( Ber « g ) if^ = 0. 


To compute p(wj \wf ), we propose Vj G Mb, 


| p oi(A^) if w lq = f, 

| P °i(^d,q) ifw j, q = 0. 


where Poi(A) symbolizes a Poisson distribution parametrized by A, with \ d q > 0 and A^ > 0 . 
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Physician Feature Distributions 


Similar to patient feature formulation, we can create the conditional distributions of physician 
features given physician label p(Z|y). We separate physician features Z into two parts. The first 
part, z 9 , accounts for physician general demographics such as specialty, gender, patient count, and 
the state where his or her office locates. The second part, z D , accounts for physician’s overall 
office claims histories. For this part, we create maximum, minimum, sum and average of observed 
number of claims for each physician. 

Specifically, we model 

N 

p(% |y) = 

i= 1 
N 

= 

i= 1 

where Vi e JVa, the above distribution can further be factorized as 

3 

(3) PfaflVi) = Il^fM 

i=i 


In Equation 3, we have 

• physician gender zf , e {0,1}: p(zf 1 \y i ) ~ Ber(o; s ), where if y l — 1, u g — (Jg \ else 

(o) 

COg = COg ■ 

• physician specialty code zf 2 e {0, 1, • • • , 189}: p(zf 2 \yi) ~ cate( 7 p ) 

• physician patient count zf 3 e N+: p(zf 3 \yi) rs_/ Pois(A c ) 

where in all of above distributions, we will have separate sets of parameters for both positive and 
negative classes. 

Let z[\ G M be the maximum, minimum, summation, and average number of office claims of 
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physician i for his/her patients, for l e {1,2,3, 4}. Also let z D = [zf\ , zf 2 , z[^Y symbolize all 
the claims related features, then we assume that the claims related features follow a joint Gaussian 
distribution represented by, 

, Dl , f JV(A«i, Si,) if!/, = 1, 
p(z Ik) ~ < 

[jV( M S,SS>) if Vi — 0. 

where N(/i, S) denotes multivariate normal distribution parametrized by fx and X. 


Physician Labels Joint Posterior Distribution 

The last component in the joint distribution is p(y |x), which represents the conditional distribution 
of physician labels given patient labels. It describes the relationships between patient labels and 
physician labels from a probabilistic point of view. Intuitively, p(y|x) demonstrates under what 
conditions a physician can be regarded as positive or negative. Let x' be the positive labels of the 
patients associated with physician A,. Note that physician labels are determined by her patient 
labels only, then the p(y|x) can be factorized as 

N 

p( yl x ) = 

i= 1 
N 

( 4 ) = YlpiVil* 1 ) 

2—1 

where the second equal sign is due to the fact that one physician’s label is conditional independent 
with the labels of patients not in her set given her patients’ labels, i.e., [Ay,. x J jx' ) = 

p(?/i|x*)p(x^jx*), Vj Y i- 

In Equation 4, the p(y i \x. z ) is given by 

(5) p(y*ix*) = (i - n i ( x 3 = o)) fn^=°) 

V jeM / Vie Ni 
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Predictive Inference for Physician Targets 

So far we have derived explicit formulas for each factor form in Equation 1. The joint distribution 
p(x, y , W, Z) is just the product of all factors. In this section, we will follow the joint distribution 
formula to predict physician labels, y . To compute the predictive results, we first convert the Bayes 
network to a factor graph and then apply the prediction algorithm of inferring the physician labels 
given known features. We’ll show the rationale and details below. 

According to the Bayes rule, the joint posterior of y is given by 

(6) p(y|W,Z) = ^p(y,W,Z) 

= |EE-E p(x. y . W. Z), 

£ 1=0 £ 2=0 XM= 0 

where Z = J w f z p(x, y, W, Zjr/Zr/W denotes the normalizing constant for posterior of y. Here 
we remark that the Z is computationally expensive but it will be shown soon that this constant 
doest not need to be computed. The M order summation after the second equal sign is because one 
need to marginalize x to compute p(y|W, Z). Following Equation 6, the marginalized posterior 
of y, has the form 

i 

(V) p{ yi |W,Z) = J>(y|W,Z), 

y i o 

where J^y_ i= o denotes a summation over all random variables expect y, in y. 

Note that computing p(y i |W, Z) implemented in Equation 6 and Equation 7 with brutal force will 
require marginalizing M + N — 1 binary variables, whose computation cost increases exponentially 
as JV or M grows. In our studied data set, we have N = 68, 898 and M = 247, 833, which 
prohibits us from using traditional marginalization methods. To compute the p(^|W, Z) in an 
efficient manner, we introduce the factor graph model and its associated variable marginalizing 
algorithm which is called message passing algorithm. 
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Factor Graph and Message-passing algorithm 


O - 

1 

<t> 

0— 

a,2 ^ _ 0b, 2 


#-© 


4 ^ 0,2 


a-® 



Figure 3: Factor graph depicting the relationships between variables nodes and factor nodes. 

In our proposed model, the doctor labels are conditional independent with each other given the 
patient labels; and the patient labels are mutually conditional independent if the doctor labels are 
given. Due to such conditional independence in the proposed model, a factor graph will allow us 
to solve the original giant marginalization problem by exploit the “Divide and Conqure” strategy, 
which will be shown in the rest of the sections. 

A factor graph (Kschischang et al. 2001) is a type of probabilistic graphical model that contains two 
types of nodes: Variables, which can be either evidence variables whose value is known, or query 
variables whose value should be predicted or marginalized. Factors, which define the relationships 
between variables in the graph. In the proposed model, patient features W and physician features 
Z are evidence variables while patient labels x and physician labels y are query variables. 

The factor graph of the proposed model is drawn in Figure 3, where each circle represents a variable 
node and each square represents a factor node. For the variable nodes, they have the exact same 
definition as the variables in the Bayes network in Figure 2. For the factor nodes, there are three 
classes of them, <©, o b and <i) c . For each i e J\f A , o aj connects z, and ?/,, <j> bti connects and the 
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Xj, where j G M t . For each j G Mb, 4>c,j connects x 3 and the W ; . 

By noting the factorization form in Equation 1, we can define the specific definitions of (f) a , ©/, and 

4>c by 


0a, i (Zi, j yi) 

= P(Zi|j/i), 

Vi G M a , 

faAyi’X 1 ) 

= p(yi,x l ), 

Vi G M a , 

<M Wj, Xj ) 

= piW, Xj). 

Vj G Mb 


Using the explicit forms of p(z i \y i ), p(y.i, x' ) andp(wj \xj), one can show that the joint distribution 
p(x, y . W, Z) can be factorized as: 

N M 

(8) p(x, y,W,Z) = Jj0a,i(Zi,S/i)06,i(yi,X*) 

*=1 j=l 

With the factor graph for our proposed model in Figure 3, now we can apply the Message-passing 
algorithm (Kschischang et al. 2001) for marginalization. The topology of the factor graph in Figure 
3 has no loop and it is a tree structure graph. Considering that Kschischang et al. (2001) has shown 
exact marginalization results will be achieved in graphs without loop, we can get the exact value 
of p(yi |W. Z) by the Message-passing algorithm. 

Message-passing algorithms that operate on factor graphs aiming at calculating the marginal 
distribution for each unobserved node, conditional on any observed nodes. Here, we will show 
how to compute p(y i: W, Z) by the sum-product message-passing algorithm. The idea is that 
instead of computing p(y, |W, Z) by Equations 6 and 7 directly, it is much more efficient to first 
solve the p(y j, W. Z) by marginalizing all x and y_, variables using sum-product algorithm, then 
compute 

pjy» w,z) 

Ei=oP(j/i» w » z ) 


p{Vi |w,z) 
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where the summation merely takes over a binary variable y*. 

As is shown by the factor graph topology in Figure 3, it has more than one components, whereas 
each components of it has a tree structure. Kschischang et al. (2001) show that the sum-product 
algorithm yields an exact result of p{y i: W, Z) by exact 2 L message passings, where L is the 
number of edges in the factor graph. According to the algorithm, there are two types of messages. 
When the message is from a factor node s to a variable node v, the message is a probability 
distribution given by 


/A— ^ ^ (p.s {K ) n f^U—tS (tt) 1 

u£Afs\v u£Afs\v 

where JV S represents all variable nodes in the neighbor set of node s, and node u is a node in node 
s’s neighbor set but not equal to v. On the other hand, the message from a variable node u to a 
factor node s is given by 


P’U— »s(^0 P’Ul— 

LU£JV U \S 

where uj denotes a node in node u ' s neighbor set but not equal to s. 

In the factor graph of the proposed method, one can show that the message from variable nodes 
to factor nodes are easy to compute because it is just the production of few functions. The 
computation cost of messages from factor nodes to variable nodes are seemingly large but one 
can show that when the summation with respect to variable w or z, the summation result will be 
one. This is because 0 a and <i) c are probability density functions of w and z respectively. When the 
summation is over variable x' or y,, the computation cost is still small because for every 0 i, node, 
it only connects with a few variable nodes. 
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Parameter Learning 

Other than the predictive inference algorithm, we will show parameter estimation for the proposed 
model next. From machine learning point of view, the parameter estimation process can 
be understood as a training stage, during which the computer is trained to make meaningful 
predictions for the variables of interest, e.g.,the physician label in our problem. Following we 
will present the method and results of parameter estimation for the proposed model based on the 
training data set. In the remaining of this section, we assume that the values of the observed 
variables are from the training data set. In order to find the maximum likelihood estimate of the 
parameters, we aim at solving the following optimization problem: 

(9) 6 = argmax p(W, x, y, Z; 6) 

e 

= argmax L(Z, y; a) + L(W, x; (3) 

a.,f3 

where L(Z,y;a) = log(p(Z|y; «)), L(W, x; /3) = log(p(W|x; /3)), and where a and (3 
symbolize all parameters in physician and patient feature distributions, respectively, and the second 
equal sign is because of the joint distribution can be factorized as in 1. Here we remark that since 
there is no unknown parameters in p(y|x) and p(x), both of these two distributions do not play a 
role in the objective function. Because the objective function is separable, we can estimate ol and 
1 3 by maximizing L (Z . y : a) and L(W, x; j3) respectively. Next, we will derive the explicit form 
of the parameters. 

Patient Feature Parameters 

By Equation 9, we compute the maximum likelihood estimates of the parameters and present 
the results in Table 1. The table shows that the estimates from both classes are mostly close, 
which again reveals the challenge of our task. Recall that the parameters are used to determine the 
conditional distribution of patient features given patient labels, i.e., p(W|x). 
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Table 1 : Maximum Likelihood Estimate of Patient Features 


Parameter 

Negative class 

Positive class 

Prob. of patient gender being male rf^ , r/ ( ' 

0.3812 

0.2689 

Age decade distribution 7 °, 7 * 

See Figure 4 


Region distribution 7 °, 7 ) 

[0.394, 0.223,0.217,0.166] 

[0.316, 0.303, 0.194, 0.186] 

Clinical code indicator parameters rf d , rj d 

See Figure 5 


Clinical code frequency parameters A®, A^ 

See Figure 6 



In table 1, the first line lists the parameter estimation for a patient gender being male in both 
negative and positive classes. From the result, we can see that the positive class patients have lower 
probability to have male gender (0.27 vs. 0.38). By the third line, the patient region distributions 
are close between two classes, which provides little information to differentiate the rare disease 
patients. This is a common phenomenon in a hard problem like this. We will need to utilize ah 
information collectively to get better prediction. 


0.20 



Age decades 

Figure 4: Estimation results of ~/ a , patient age decade parameters 


Similarly we plot the probability mass function of patient age decades for both classes in Figure 4. 
The age decade 0 denotes age from 0-9, 1 denotes 10-19, etc. The results from the training data set 
show that positive and negative classes have different distributions. If a patient is positive, then he 
or she is most likely to be in fifties than in other age decades, whereas if the patient is not positive, 
the age is most likely to be the thirties. At the same time, we observe that the ratio of positive 
patient among patients in thirties is larger than this ratio of positive patients in fifties. 
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Clinical codes 

Figure 5: Estimation results of r] d , patient clinical code indicator parameters. The x-axis index 0 
to 57 represents the 58 types of clinical codes. 


To demonstrate the results of patient clinical code indicator parameters 77 , we estimate the 
likelihood of predicting a positive label for both positive and negative class of patients. These 
estimates are obtained by computing the positive patient ratios in both classes. The results are 
plotted in Figure 5, where we can see that for certain clinical codes, positive patients have a much 
higher probability to get a positive result than the negative patients. In particular, we list the 
estimated values of top five clinical codes with largest relative differences between classes in Table 
2. These top clinical codes represent clinical procedures, prescriptions and diagnoses. 


Table 2: Clinical Codes Parameter Estimation - Top 5 


Clinical codes 

Negative class rf d 

Positive class rj d 

Chronic idiopathic urticaria 

0.0031 

0.0592 

Epinephrine 

0.0280 

0.4184 

Personal history of Allergy 

0.0054 

0.0357 

Allergy/ Anaphylaxis/Urticaria 

0.0482 

0.2685 

Laryngoscopy 

0.0152 

0.0414 


The last line of parameters X d in Table 1 are two 58 x 1 column vectors. They are the parameters 
of the Poisson distribution which models the patient clinical code frequency features in each of the 
58 clinical codes. Since the maximum likelihood estimate of Poisson distribution is the sample 
mean, we plot the parameter means for both positive and negative classes in in Figure 6 . We can 


23 



Figure 6: Estimation result of \ ( i, clinical code frequency feature parameters, where the x-axis 
index from 0 to 57 represent the selected clinical codes. 

find that the clinical code frequency features are similar for both positive and negative patients. 
This implies that patients with this rare disease condition look like all other patients in clinical 
code frequencies. Again these features may not have high predictive power by themselves. 

Physician Feature Parameters 

We list physician parameter estimates in Table 3. These feature parameters include physician 
gender, patient count, specialty and number of office claims related. The first set of parameters rf f \ 
and rfl depicts the estimated probability of a physician gender to be male conditional on his positive 
or negative label. The second set of parameters A° and A', are for patient count distribution. Again, 
because of the Poisson assumption, these parameters can be computed using by the patient count 
mean for both of the classes. From the results, a physician who has positive label treats more 
patients (29) than a physician who hasn’t (20). 

We plot all 189 specialties estimation results in Figure 7. Some of the specialty physicians are more 
likely to have this type of rare disease patient(s) than the others. Table 4 lists the estimated value 
for the top 5 most common specialties. For example, among negative physicians 15.63% of them 
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Table 3: Physician Features Estimation Results 


Parameter Maximum likelihood estimate 


Prob. of physician gender being male rf e 


0.8108 


vl 


0.7975 


Patient count mean X° c 


20.1514 


A l 


29.0939 


Specialty distribution 7° 


See Figure 7 


7 p 




Claims-features mean fi° d [0.001 , 

0.006 , 0.013 , 

-0.026] 1 

[- 

0.007 , 

-0.042,-0.097,0.191] 


" 0.977 

0.098 0.820 

0 . 727 " 

Claims-features covariance T,° n 

0.098 

1.108 0.246 

0.136 

0.820 

0.246 0.997 

0.727 


0.727 

0.136 0.727 

0 . 897 . 


" 1.171 

0.066 0.952 

1 . 011 " 

E b 

0.066 

0.216 0.085 

0.055 

0.952 

0.085 1.013 

0.919 


1.011 

0.055 0.919 

1.704 


are in specialty Diagnostic radiology, whereas among positive physicians, this ratio is 25.18%. 


Table 4: Physician Specialty Estimation - Top 5 Most Common 


Specialty 

Negative class 

Positive class 

Diagnostic radiology 

15.63% 

25.18% 

Emergency medicine 

7.78% 

9.91% 

Cardiovascular disease 

8.57% 

6.02% 

Family medicine 

8.51% 

5.83% 

Anatomic/clinical pathology 

2.97% 

4.18% 


In the physician feature distribution section, we assume the claims related features such as 
maximum, minimum, sum and average number of claims follow a multi-variate Gaussian 
distribution. So the sample mean and the sample covariances are the maximum likelihood estimates 
of the fi d and E/> We show the results in the last four lines of Table 3. There is no direct 
interpretation for these estimation results. But collectively with all other information they can 
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Specialties 

Figure 7: Estimation result of A,/, physician specialty distribution, where the x-axis index 189 
specialties. 

contribute to the prediction improvement. 


Empirical Results 

To validate the proposed model, we will provide two experiments that show the performance of 
our method and its comparisons to the random forest method (Ho 1998). In the experiments, we 
use positive predictive value(PPV) vs. sensitivity analysis (Friedman et al. 2001), FI score and the 
Matthews correlation coefficient for performance evaluation. 

Since we form this rare disease targeting as a binary classification problem, the detector 
performance can be measured by the true positive(TP), false positive(FP), true negative(TN) and 
false negative(FN). Then the PPV (also referred as precision) is defined as PPV = TP / (TP + FN); 
and sensitivity (also called recall) is defined as sensitivity = TP / (TP + FP). Namely, PPV is the 
number of correct positive results divided by the number of all positive results, and sensitivity 
is the number of correct positive results divided by the number of the predicted positive results. 
When comparing two detectors, with same sensitivity, the better detector should have a larger PPV 
value. Similary, with identical PPV, the detector with larger sensitivity performs better. Obviously, 
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both the PPV and sensitivity are between zero and one. 


Another performance metric of binary classification’s accuracy is the FI score, which summarizes 


the PPV and sensitivity to a single real number. The FI score is defined as two times the harmonic 


mean of PPV and sensitivity, given by 


2 • PPV • sensitivity 


PPV + sensitivity 

1, where an FI score reaches its best value at 1 and worst at 0. 


. FI score have value between 0 and 


The third performance metric we use is the Matthews correlation coefficient(MCC)(Matthews 
1975). It is generally regarded as a balanced performance measure for binary classification which 
can be used even if the classes are of very different sizes. A coefficient equal to 1 means perfect 
prediction, 0 denotes random guess and 1 indicates total disagreement between prediction and 
observation. Noting that the data set used for validation is highly imbalanced, we use MCC as the 
third performance metric for its consistency in data balance. 

We select the random forest model as our benchmark method for results comparison. The 
random forest classifier has demonstrated its capability and robustness in a variety of classification 
problems (Liaw and Wiener 2002; Dfaz-Uriarte and De Andres 2006; Svetnik et al. 2003). For this 
benchmark, we specify the random forest with 200 decision trees. The random forest model can 
use the same physician level features. But unlike the proposed model, it cannot incorporate patient 
and physician dependency directly. For each physician, we average all patients records that link to 
this physician and create similar patient features at physician level. The benchmark random forest 
is implemented through the Python package scikit-learn (Pedregosa et al. 2011). 


One-fold validation 

In the first experiment, we split randomly all the 68,898 physicians into one testing and one training 
data sets. The training set has 6,000 ( 10% ) physicians, where 713 physicians are positive and 
5,287 are negative. The testing set contains the rest 62,898 physicians including 55,265 positive 
and 7,633 negative physicians respectively. To avoid information leakage from the patient label in 
the training data set, any patient connected with any physician in testing data set shall be regarded 
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as a testing patient. As a result, according to the train-test split in the physicians, 161,681 patients 
are grouped into the training set, and 86,152 patients are grouped into the testing set. There are 
736 positive and 160,945 negative patients in the training set, and there are 497 positive and 85,655 
negative patients. 

In Table 6, we show the experiment result of the proposed result and the benchmark method, 
where the subscript ‘pm’ and ‘bnT correspond to ‘proposed method’ and ‘benchmark method’ 
respectively. We can see that the proposed method shows a much higher PPV than that of 
benchmark method. In particular, although the benchmark method hardly work (PPV less than 
0.05) when the sensitivity is greater than 0.35, the proposed method yields an acceptable PPV 
(greater than 0.2). Similarly, the FI score by the proposed method consistently higher than that of 
the benchmark method, especially when the sensitivity is greater than 0.35. In comparison with 
the best MCC of the benchmark method 0.3699, the best MCC of the proposed method is 0.4207, 
which implies a 13.7% performance increase. 

Table 5: Comparison results between proposed and benchmark methods in one-fold validation 


Sensitivity 

PPVpm 

PPV 6m 

MCC pm 

MCC pm 

FI score p m 

FI score fem 

0.2 

0.8036 

0.7826 

0.2403 

0.2354 

0.3206 

0.3190 

0.25 

0.7013 

0.5946 

0.2860 

0.2549 

0.3681 

0.3523 

0.3 

0.6143 

0.4825 

0.3168 

0.2720 

0.4026 

0.3699 

0.35 

0.4937 

0.0589 

0.3219 

0.1021 

0.4207 

0.1008 

0.4 

0.2917 

0.0449 

0.2686 

0.1010 

0.3379 

0.0807 

0.45 

0.2019 

0.0365 

0.2429 

0.1006 

0.2788 

0.0674 


To compare the two methods in term of PPV values, in Figure 8, we plot the sensitivity-PPV curves 
of both methods, where each dot on the curve represents a sensitivity PPV pair. We can see almost 
for every value of PPV, the proposed method has a higher sensitivity than the benchmark method. 
The area under curve serves as a single variable summary of the PPV performance, which is 0.3553 
for the proposed method, and 0.2857 for the benchmark method, suggesting that the overall relative 
performance increase is 24.31%. 
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Figure 8: Sensitivity with respective to PPV curves for proposed and benchmark methods in one- 
fold validation 

Ten-fold cross validation 


In the second experiment, we carry out the ten-fold cross validation with both methods to 
demonstrate the robustness of the proposed method. Specifically, the physicians are randomly 
partitioned into 10 equal size groups. Then the process of the first experiment is repeated for 10 
times, where in each time one single group of the physicians is retained as the testing set, while all 
the rest 9 groups are treated as training data set. Note that there is no overlap of any two testing 
data set. Consequently, 10 set of performance measure results are obtained. Table 6 shows the 
average PPV and MCC of both methods with respect to various sensitivity values. We can see that 
the proposed method outperforms the benchmark in term of both average PPV and average MCC. 

Table 6: Comparison results between proposed and benchmark methods in ten-fold validation 



Average PPV 


Average MCC 


Sensitivity 

Proposed method 

Benchmark method 

Proposed method 

Benchmark method 

0.2 

0.8046 

0.6469 

0.2354 

0.1943 

0.25 

0.6839 

0.4664 

0.2781 

0.2132 

0.3 

0.5732 

0.3342 

0.3007 

0.2158 

0.35 

0.4261 

0.0754 

0.2893 

0.1019 

0.4 

0.2332 

0.0690 

0.2282 

0.1153 

0.45 

0.1295 

0.0441 

0.1821 

0.0991 
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For more detailed comparison results, in Figure 9 we plot the ten fold results for both methods, 
where the the x-axis denotes the fold index from 0 to 9, the y-axis represents the MCC values. 
Comparing the left and right sub-figures, we can see that with identical color, i.e., with same 
sensitivity level, the curve on the right sub-figure is higher than that on the left figure, which shows 
that in all ten folds, the proposed method has a higher MCC. 


Benchmark method 

Sensitivity = 0.2 

— Sensitivity = 0.25 

Sensitivity = 0.3 

Sensitivity = 0.35 



0.00 

0123456789 
Fold index 


Proposed method 



0123456789 
Fold index 


Figure 9: Ten-fold MCC results of the benchmark and proposed methods 


Conclusion 

The goal of this article is to enhance rare disease physician targeting precision by exploiting 
extra structural relationship among physicians and patients. It is often a hard problem to identify 
rare disease treating physicians out of a large physician population. The difficulties come from 
many aspects such as extreme imbalance in classes and rare disease patients looking alike to 
common condition patients. We propose a graphic representation and probability model to join 
physician and patient features together with their network relationship. Through the graphical 
structure, researchers can visualize the connectivity among physicians and patients. The graphic 
representation provides clear interpretability of data entities and correlation. The proposed model 
also has flexibility to specify additional dependencies, add features or extend to more complicated 
network structure. In the empirical example, we use factor graph to predict physician rare disease 
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flag. We compare the results to random forest benchmark and find much improved targeting 
accuracy. Especially at high sensitivity level, the proposed method show significant improvement 
over benchmark. In practice, this means when a smaller target is needed under tight marketing 
budget, the proposed method can yield superior results by identifying more real targets. 

The literature shows that the graphic model methods such as factor graph or Markov random fields 
have the ability to specify and utilize these relationship to improve performance. In pharmaceutical 
marketing, there exists complex relationship among various stakeholders. But to our knowledge, 
there is limited effort to take advantage of such information. This article provide extra data point 
to demonstrate the usefulness of utilizing physician and patient structural link. 

Future research 

The presented case has certain limitations. First of all, we use binary classifier to predict rare 
disease physician identity. The outcome only indicates whether or not a physician having rare 
disease patients. The future research can extend to multi-class or continuous response such that it 
can predict how many rare disease patients for a predicted physician. 

In this study we use independent assumption among some of the feature to simplify model 
structure. We hope other studies can consider adding covariance to account for such feature 
correlations. Furthermore, we only include basic physician features such as specialty and location, 
etc. in our demonstration. But there are extensive historical treatment data available for each 
physician. It is possible to enhance the physician similarity link by incorporating those information 
in the graph nodes. A stronger physician similarity structure can potentially increase predictive 
power. It also calls for extra research on how to extra useful features from high dimensional 
physician history data. 

Although in our study the proposed model improves rare disease targeting accuracy to certain 
degree, the precision still leaves a lot to be desired. We hope to see more innovative methods to 
tackle this hard problem in the future research. 
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